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Abstract 

This article focuses on signal classification for deep-sea acoustic neutrino detection. In the deep sea, the background of 
transient signals is very diverse. Approaches like matched filtering are not sufficient to distinguish between neutrino- 
like signals and other transient signals with similar signature, which are forming the acoustic background for neutrino 
detection in the deep-sea environment. A classification system based on machine learning algorithms is analysed with 
the goal to find a robust and effective way to perform this task. For a well-trained model, a testing error on the level 
of one percent is achieved for strong classifiers like Random Forest and Boosting Trees using the extracted features 
of the signal as input and utilising dense clusters of sensors instead of single sensors. 
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1. Introduction 

Essential for the feasibility of acoustic neutrino detection is a good understanding of the background of transient 
acoustic signals in the deep sea and the ability to suppress them or identify them as background. The transient signals 
are very diverse and originate from anthropogenic and biological sources as well as weather-correlated sources. The 
aim of the AMADEUS project QJJ] is to investigate the method of acoustic neutrino detection. AMADEUS is integrated 
into the ANTARES neutrino telescope |2], which is located in the Mediterranean Sea and the acoustic set-up consists 
of six clusters of six acoustic sensors each. The spaces between the sensors within the clusters are about 1 m and 
between the clusters up to 350 m. In the experiment, transient signals with bipolar (i.e. neutrino-like) content are 
selected using on-line filtering techniques. As the variety of recorded transient signals is still high, an effective 
classification scheme to discriminate between background and neutrino-like signals is researched and presented here. 
The analysis chain incorporates a simulation of transient signals, a filter analogous to the one used on-line in the 
experiment, feature extraction algorithms and the signal classification based on machine learning algorithms. 

2. Method 

The goal of this research is to find a robust and well performing system to distinguish between neutrino-like and 
other transient signals occurring in the deep sea, like man-made and biological sources. In this Section, the methods 
used for training and testing the classification system will be explained. 

2.1. Simulation 

A special purpose simulation was designed for testing the feature extraction and classification system, which is 
also trained with simulated data. The simulation is capable of generating typical deep-sea signals, waveforms present 
at the ANTARES site like bipolar and multi-polar pulses, echoes of the ANTARES acoustic positioning system or 
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random signals. The different signal types are generated following a uniform frequency distribution. Starting from 
random source positions within a given volume around the detector, the signals are propagated to the sensors and 
characteristic ambient noise of different sea levels is added. The output - a continuous data stream - is directed to the 
filter and from there to the feature extraction or directly to the classification system. 

2.2. Filtering and Feature Extraction 

As a first step, the incoming continuous data stream is subjected to a filter system equivalent to the one used in the 
experiment, where it is used to reduce the amount of data stored for off-line classification and reconstruction. The filter 
set-up consists of an amplitude threshold for strong transient signals, which is self-adjusting to the changing ambient 
noise conditions, and a matched filter for bipolar signals fl3[]. As reference signal for the matched filter a bipolar pulse 
is used according to the one, which is produced by a 10 2() eV Shower at a distance of 300 m perpendicular to the 
shower axis (01 . In a next step, the characteristics of the filtered signals are extracted. The resulting feature vector 
contains the time and frequency domain characteristics of the signal as well as the results of a matched filter bank, 
which was tuned for neutrino-like signals. The bank consists of six reference signals corresponding to angles of 90 ° - 
96 in one degree steps to the shower axis of a 10 20 eV Shower at a distance of 300 m. In the time domain, the number 
of occurring peaks and the peak-to-peak amplitude of the largest peak, its asymmetry and duration are extracted. In 
the frequency domain, the main frequency component and the excess over the noise background are used as features. 
From the results of the matched filter bank, the best match is taken into account. From this matched filter output the 
number of peaks and the amplitude, the width and the integral of the largest peak are stored in the feature vector. As 
an independent feature vector, the filtered waveform itself can be subjected to the classification algorithm. 

2.3. Classification 

The classification system stems from machine learning algorithms ((J trained and tested with data from the sim- 
ulation. As input, either the extracted feature vector or the filtered waveform is used; as output, either binary class 
labels (bipolar or not) or multiple class labels (one for each signal type in the simulation data) are predicted. The 
following algorithms 13] have been investigated for individual sensors and clusters of sensors: 

• Naive Bayes: This simple classification model is based on applying the Bayes theorem and assuming that the 
features are conditionally independent of one another for each class. For a given feature vector, the class is 
selected using probabilities gained from the training data. 

• Decision Tree: This classification model stems from a tree-like structured set of rules. Starting at the root, the 
tree splits up on each node based on the input variable with the highest information gain. The path from the 
root of the tree to one of the leaves, which are representing the class labels, defines one rule. 

• Random Forest: A Random Forest is a collection of decision trees. The classification works as follows: The 
Random Forest takes the input feature vector, makes a prediction with every tree in the forest, and outputs the 
class label that received the majority of votes. The trees in the forest are trained with different subsets of the 
original training data. 

• Boosting Trees: They combine the performance of many so-called weak classifiers to produce a powerful 
classification scheme. A weak classifier is only required to be better than chance. Many of them smartly 
combined, however, result in a strong classifier. Decision trees are used as weak classifiers in this boosting 
scheme. In contrast to a Random Forest, the decision trees are not necessarily full-grown trees. 

• Support Vector Machine: A SVM maps feature vectors into higher-dimensional space. A hyper-plane is 
searched so that the margin between this hyper-plane and the nearest feature vectors from both of the two 
labels of a binary class is maximal. 

The algorithms used for Boosting Trees and SVM are restricted to binary class labels as output. The same training 
and testing data sets are used for the different algorithms. The predictions for the individual sensors are combined to 
a new feature vector and used as input in order to train and test the models of the clusters of sensors. 
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3. Results 



In this section, the performance results of the classification system will be described. Two indicators are used to 
measure the performance of the classification: the testing error, which is the error of the prediction with respect to the 
simulation truth, and the success of training, which is the ratio between testing error and training error and indicates 
whether the model is under-trained (< 1) or over-trained (> 1). As an overall result, multiple class labels as output 
are less effective than the binary ones, by more than factor of two. The binary class labels are the standard output of 
results further presented. Weak classifiers like Naive Bias and Decision Trees show a high testing error above 14 % 
and are neither more robust against changing ambient noise conditions nor significantly faster than other classifiers 
(cf. Fig.[TJ. Although the SVM is a strong classifier, its high numerical complexity and missing robustness disqualifies 
it (cf. Fig.|2|. Thus the most favorable classifiers are Random Forest and Boosting Trees. In addition, the usage of 
clusters shows a substantial improvement over individual sensors. Random Forest and Boosting Trees are robust and 
produce well-trained models. The elapsed time for processing one event is less than a second. For the individual 
sensors and the extracted features as input, a testing error of about 5 % for the Boosting Trees and for the Random 
Forest of about 10 % is achieved, which is further improved by more than a factor of 4 by combining the sensors to 
clusters with errors well below 1 % (cf. Fig.|3]and Fig.0. Using the extracted waveform as input yields similar results, 
the Random Forest achieves a testing error of about 6 % and the Boosting Trees of about 12 %. These errors are also 
improved by a factor 4, when combining the individual sensors to clusters (cf. Fig.|5]and Fig.|6]l. 
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Figure 1 : The testing error is shown as a function of the training samples for Decision Tree, Naive Bias and SVM classifiers. As input, the extracted 
feature vector is used and the binary class labels as output for individual sensors. 
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Figure 2: The success of the training is shown as a function of the training samples for Decision Tree, Naive Bias and SVM classifiers. As input, the 
extracted feature vector is used and the binary class labels as output for individual sensors. A value of one indicates that the model is well-trained. 



4. Conclusion 

The results show that machine learning algorithms are a promising way to find a robust, effective and efficient 
classification system. The classifiers perform well under different levels of ambient noise and are able to distinguish 
between bipolar (i.e. neutrino-like) and other signals, especially to differentiate them from short multi-polar signals. 
This is necessary for the further analysis of neutrino-like events in the sense of searching for the specific pancake- 
shape of the spatial pressure distribution from a neutrino interaction. 



5. Outlook 

In a next step, the classification system will be tested against data from the experiment. If the performance is 
matched to the simulation results, it will be used to perform an analysis of the temporal and spatial distribution of 
the background of bipolar signals. The system will then be extended towards classifying neutrino-like events with all 
their features, in particular their disk-like spatial propagation. 
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Figure 3: The testing error is shown as a function of the training samples for Random Forest and Boosting Trees classifiers. As input, the extracted 
feature vector is used and the binary class labels as output for individual sensors and clusters of sensors (indicated by "clustered"). 
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Figure 4: The success of the training is shown as a function of the training samples for Random Forest and Boosting Trees classifiers. As input, 
the extracted waveform of the signal is used and the binary class labels as output for individual sensors and clusters of sensors (indicated by 
"clustered"). A value of one indicates that the model is well-trained. 
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Figure 5: The testing error is shown as a function of the training samples for Random Forest and Boosting Trees classifiers. As input, the extracted 
waveform of the signal is used and the binary class labels as output for individual sensors and clusters of sensors (indicated by "clustered"). 
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Figure 6: The success of the training is shown as a function of the training samples for Random Forest and Boosting Trees classifiers. As input, 
the extracted waveform of the signal is used and the binary class labels as output for individual sensors and clusters of sensors (indicated by 
"clustered"). A value of one indicates that the model is well-trained. 
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